Stratified Sampling Meets Machine Learning

نویسندگان

Edo Liberty

Kevin J. Lang

Konstantin Shmakov

چکیده

This paper solves a specialized regression problem to obtain sampling probabilities for records in databases. The goal is to sample a small set of records over which evaluating aggregate queries can be done both efficiently and accurately. We provide a principled and provable solution for this problem; it is parameterless and requires no data insights. Unlike standard regression problems, the loss is inversely proportional to the regressed-to values. Moreover, a cost zero solution always exists and can only be excluded by hard budget constraints. A unique form of regularization is also needed. We provide an efficient and simple regularized Empirical Risk Minimization (ERM) algorithm along with a theoretical generalization result. Our extensive experimental results significantly improve over both uniform sampling and standard stratified sampling which are de-facto the industry standards.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Support Vector Machine based on Stratified Sampling

Support vector machine is a classification algorithm based on statistical learning theory. It has shown many results with good performances in the data mining fields. But there are some problems in the algorithm. One of the problems is its heavy computing cost. So we have been difficult to use the support vector machine in the dynamic and online systems. To overcome this problem we propose to u...

متن کامل

Accelerating Minibatch Stochastic Gradient Descent using Stratified Sampling

Stochastic Gradient Descent (SGD) is a popular optimization method which has been applied to many important machine learning tasks such as Support Vector Machines and Deep Neural Networks. In order to parallelize SGD, minibatch training is often employed. The standard approach is to uniformly sample a minibatch at each step, which often leads to high variance. In this paper we propose a stratif...

متن کامل

Calibration for Stratified Classification Models

In classification problems, sampling bias between training data and testing data is critical to the ranking performance of classification scores. Such bias can be both unintentionally introduced by data collection and intentionally introduced by the algorithm, such as under-sampling or weighting techniques applied to imbalanced data. When such sampling bias exists, using the raw classification ...

متن کامل

Convergence Optimization of Backpropagation Artificial Neural Network Used for Dichotomous Classification of Intrusion Detection Dataset

There are distinguished two categories of intrusion detection approaches utilizing machine learning according to type of input data. The first one represents network intrusion detection techniques which consider only data captured in network traffic. The second one represents general intrusion detection techniques which intake all possible data sources including host-based features as well as n...

متن کامل

Grover Title of Thesis : Active Learning and its Application to Heteroscedastic Problems

This thesis presents Active Learning algorithms for heterogeneous distributions. Active Learning is a vast and growing sub-field of Machine Learning, where many significant contributions have been made. This thesis makes two contribution to the Active Learning field. First contribution is a broad survey of Active Learning literature and second contribution is two new Active Learning algorithms ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2016

Stratified Sampling Meets Machine Learning

نویسندگان

چکیده

منابع مشابه

Support Vector Machine based on Stratified Sampling

Accelerating Minibatch Stochastic Gradient Descent using Stratified Sampling

Calibration for Stratified Classification Models

Convergence Optimization of Backpropagation Artificial Neural Network Used for Dichotomous Classification of Intrusion Detection Dataset

Grover Title of Thesis : Active Learning and its Application to Heteroscedastic Problems

عنوان ژورنال:

اشتراک گذاری